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The Case For Dynamic Content 



Flexibility vs Performance 


Scripting is great, but too slow for high performance tasks 

Even with JIT 

Some high performance areas that would benefit from increased flexibility 

Particle simulation 

Wind simulation (and other vector field effects) 

Sound processing 
• • • 


How can we make scripting work for them? 


What We Want: 

Fully Scripted FX with Near Native Performance 

(programmer art) 



Existing Solutions 


Stack of C “modifiers” or liters” 

Good performance, limited flexibility 
Must get C programmers to add new filters 
Bad for generic & reusable engine 

Runtime code compile 

Promising but tricky to get right 
Need compilers for all platforms (server?) 

Need runtime linking on all platforms (iOS) 

Must be converted to static code for “final” release 


Artist selects filters and parameters in tool 
gravity(0,0,-9.82), whirlwind(0, 0, 5) 

Loop with switch statement in C code 
Apply one filter at a time 

Artist creates effect in tool 
Tool generates C code for running the effect 
C code gets compiled for target platforms 
Runtime linked with running executable 


Why Are Script Interpreters Slow? 


Virtual machine 

These parts are identical: 

Native 

Decode instruction 

same machine instruction 

Execute instruction 

Jump to opcode 
Execute instruction 

addps xmmO, xmm 1 

Execute instruction 

Decode instruction 


Execute instruction 

Jump to opcode 
Execute instruction 


Execute instruction 

Decode instruction 1 
Jump to opcode f 1 
Execute instruction 

This part is the 
< overhead of using 
1 bytecode instead of 
native 

Execute instruction 

■ ■■ 


Data Wide Virtual Machine 


Perform each instruction on MANY data 

items Virtual machine 


Cost of decoding & branching is amortized 
Byte code just as fast as native? 

Can’t keep data in registers 
More loads & stores 
Touches more cache memory 

movaps xmmO, xmmword ptr [edx] 
addps xmmO, xmmword ptr [ecx] 
movaps xmmword ptr [eax], xmmO 


Decode instruction 
Jump to opcode 
Execute instruction 
Execute instruction 
Execute instruction 


Decode instruction 
Jump to opcode 
Execute instruction 
Execute instruction 
Execute instruction 


■ ■■ 


Native 

Execute instruction 
Execute instruction 
Execute instruction 
Execute instruction 
Execute instruction 


Loop Orders 


Native Modifier stack 
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How it Works: 


Built on top ofVector4 intrinsic 
abstraction 

Input/output data are channels 
(arrays of SIM D vectors) 

Byte code contains instructions for 
operating on channels 

pos = ADD pos move 

After decoding instruction, 
interpreter applies it to n objects at 
a time 


Data-Wide Interpreter 



Vector4 *a = (decode channel ref); 
const Vector4 *b = (decode channel ref); 
const Vector4 *c = (decode channel ref); 
Vector4 *ae = a + n; 
while (a < ae) { 

*a = *b + *c; 

++a; ++b; ++c; 


} 
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Bytecode Details: Constants 


Stored as aVector4 in-place in the byte code 

age = ADD constant age (0.0 0.0 0.0 0.0) de ita_ time 

NoterThe bytecode uses a separate AD D constant opcode when adding a channel and a constant 

The compiler keeps track of location of all bytecode constants 

Before running the bytecode you patch the values of all constants 

patch_constant(bytecode, hash(“delta_time”), vector4(0.33, 0.33, 0.33, 0.33)); 
age = ADD constant age (0.33 0.33 0.33 0.33) de ita_time 

When the bytecode runs, no lookup is needed for constants — ► maximum speed 


Constants 



constant for all particles during frame 


age = age + delta_time; 


Bytecode Details:T emporary Variables 

rO — MUL vel delta_time 
pos = ADD pos rO 

• We use temporary Vector4 buffers for temporary (and local) variables 

To virtual machine, no distinction between temp buffers and channels 
Temp buffers do not have to be as big as the input channel 
Only as big as n, the number of items we process at a time 

• Balance between memory use and performance 

We want high n to amortize the cost of instruction decoding 
We want low n to minimize temporary memory use 
n = 1 28 is a decent compromise 


The Big Picture 


Offline 

Data compiler parses code 

pos = pos + vel * delta_time 

Generates bytecode, introduces temporary variables as necessary 

rO = MUL vel (0.0 0.0 0.0 0.0) delta_time 
pos = ADD pos rO 

Bytecode is optimized (Temporary variable elimination) 

Runtime 

Patch the constants in the bytecode 
Execute the instructions 


Implementation Details 


• Very simple hand-written tokenizer and recursive decent parser 

~I000 lines 

• Trivial bytecode format 

OPERATOR operand i operand2 

No packing/un packing necessary, we do not need to optimize for bytecode size 

• Very simple virtual machine implementation 


Big switch statement 
~250 lines 


Real-World Example 


// Source syntax inspired by HLSL 
const float4 center = float4(0, 0,0,0); 
const float4 up = float4(0,0, 1 ,0); 
const float4 speed = float4( I , I , I , I ); 
const float4 radius = float4(5,5,5,5); 

struct vfjn 

{ 

float4 position : CHANNELO; 
float4 wind : CHANNEL I ; 


struct vf_out 

{ 

float4 wind : CHANNEL I ; 

}; 

void whirl(in vfjn in, out vf_out out) 

{ 

float4 r = in. position - center; 

outwind = in.wind + speed * cross(up, r) / dot(r;r) * radius; 


// Resulting bytecode 

// rO, rl correspond to CHANNELO, CHANNEL I 
// r2— r5 are temporary variables 

r2 = SUB rO (0,0, 0,0) center 
r3 = CROSS (0,0, 1 ,0) up r2 

r4 = MUL(l,U,l)s P eed r3 
r3 = DOT r2 r2 
r5 = DIV r4 r3 
r3 = MUL r5 (5, 5, 5, 5) radius 
rl = ADD rl r3 


Use Case I : Wind Simulation 


Wind is simulated as a superposition of 
effects 

Effect: Cull box + script with constants 
Script returns wind at position 

Evaluation at a large number of points 

Positions of particles and physics objects 

Apply culling to find relevant effects 

Merge the bytecode to a single function (for 
performance) 
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Use Case 2: Particle Simulation 


A “particle” is just a collection of channels 

Position, velocity, color, size, etc 

Editor completely defines what channels exist 

For example there could be two position channels (for a“beam“ particle). 

Particle effects written in vector language 

Editor allows using existing effects 
Or writing completely new ones 

We are transitioning to this system 

Runs in parallel with old less dynamic particle system 





Comparing Performance to Native 


Example: 64K particles with 
gravity and one collision surface 

~34 % overhead over native 
~ 1 8 % overhead over modifiers 

void update(in vf in, out vf out) 

{ 

float4 vel = in.vel + gravity*dt; 
out.pos = in.pos + vel*dt; 

float4 collide = dot(in.pos - plane_p, plane_n) < 0; 
float4 travelling_down = dot(vel, plane_n) < 0; 
out.vel = vel - 2 * vel * collide * travel ling_down; 

Source: loads & stores 

}; 




Compare to typical bytecode 
overhead: x 1 0 - x20 


Native 

Modifiers 

Scripted 

Modern PC 

0.402 ms 

0.455 ms 

0.539 ms 

On the console, the modifier 
solution exhausts L2 cache 

xl .0 

xl . 1 3 

xl .34 

X360 PS3 

5.398 ms 

1 0. 1 96 ms 

7.006 ms 

With smaller data set xl .28 

gen console 

xl .0 

xl .89 

xl .30 



But Wait — We Can Do Better! 


Rewrite the bytecode 
interpreter in AVX 


Process 8 floats at a time 
Now we run faster than native! 

Fair comparison? 

We could rewrite the native 
code in AVX as well 

But will you take the time to 
rewrite all your handwritten 
code to use AVX? 

Will you maintain multiple 
versions for SSE.AVX, Neon, etc? 



Native 

Modifiers 

Scripted 

AVX 

Modern 

0.402 ms 

0.455 ms 

0.539 ms 

0.373 ms 

PC 

xl .0 

xl . 1 3 

xl .34 

x0.92 



Conclusions 


The “data wide interpreter” model is a viable solution for high- 
performance scripting 

Completely configurable behaviors 

Fully dynamic: can be quickly reloaded, no engine recompile necessary 
18 % overhead over traditional modifier stack solution (34 % over native) 
AVX enabled scripted solution is faster than native solution 

Future 

One channel per component (pos.x, pos.y, pos.z) 

More backends: JIT compiler, GPU Compute, SPU... 
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